┌──────────────────────────────────────────────────────────────────────────────────────────────────┐ │ CYCLE_J v0.1 │ │ Midwestern Simulation │ │ │ ├──────────────────────────────────────────────────────────────────────────────────────────────────┤ │ ┌─────┐ │ │ │GPT-J│ \ o . 0 IMPLEMENTATION DETAILS │ │ └┬────┘ ┌── \,'`. . \ o ───────────────────────────────── │ │ ├───────┤ expand ctx to 4k /,.'` \,'`. │ │ ▼ └── / /,.'` DEFAULT HYPERPARAMETERS │ │┌──────────────┐ / │ ││INTERMEDIATE-1├───────────────────────────────────────────┐ BS : 32 ┌─LORA:───────┐ │ │└─┬────────────┘ │ LR : 1x10^-4 │rank : 8 │ │ │ │ ┌────── │ WARM: 10% │alpha: sqrt8 │ │ │ │ │ maximize p(txt_paraphrased|"{DA}{txt}{DB}") │ MAX └─────────────┘ │ │ │ │ + p(txt|"{DB}{txt_paraphrased}{DA}") │ GRAD: 1.0 >O >o │ │ │ │ where: │ CTX : 4096 >o │ │ ├───┤ DA, DB denote markers for domain A,B │ OPT : AdamW │ │ │ │ txt_paraphrased is a paraphrased txt by │ betas: 0.9,0.95 eps 1e-6 │ │ │ │ Mistral Instruct 7b 0.1 or Qwen2.5 │ SCHD: Cosine w/ Linear Warmup │ │ │ │ instruct │ │ │ ▼ └────── │ │ │┌───────────────┐ ┌───────────────┐ │ HYPERPARAMETER DIFFERENCES │ ││INTERMEDIATE-2A│┌─┤INTERMEDIATE-2B│───────────┬───────────┘ │ │└─┬─────────────┘│ └───────────────┘ │ INTERMEDIATE-1 : N/A (no change) │ │ │ │ ┌───────────────────────────┴───────────┐ INTERMEDIATE-2a: N/A (no change) │ │ │ merge │ │ maximize p(txt_A|DA) │ INTERMEDIATE-2b: N/A (no change) │ │ └──► models ◄──┘ │ + p(txt_B|DB) │ INTERMEDIATE-3 : N/A (no training) │ │ linearly │ where: │ CYCLE-J : NO LORA, BS 8, │ │ │ DA, DB markers for domain A,B ROLLOUTS 16, │ │ ┌──────┘ , txt_A, txt_B texts sampled LR 1x10^-5, │ │ │ O< , from domains A, B MAX GRAD 0.0001 │ │ ▼ , o< │ │┌──────────────┐ 0< │ ││INTERMEDIATE-3│ O O I've noticed a few different │ │└─┬────────────┘ o o /\ problems that I have that I could │ │ │ ┌────── o _/./ very easily solve with a CYCLEGAN │ │ │ │ maximize rewards: o ,-' `-:..-'/ for text, so that's what CYCLE_J │ │ │ │ 1. cycle consistency : o ) _ ( aims to be. I used GPT-J because │ │ │ │ a. embedding similarity* "`-....,--; `-.\ it's "low-background;" no chat │ │ ├───┤ b. rouge, bleu `' model interactions in its dataset │ │ │ │ 2. discriminator* *based on NeoBERT means no trying to circumvent │ │ │ │ trained on real+gen samples refusal behavior in sensitive │ │ │ │ during RL domains, as well as a less │ │ │ └────── filtered output quality, which │ │ ▼ is important when working with │ │ ┌───────┐ A model trained to translate between unpaired real (unfiltered) data in prod. │ │ │CYCLE-J│ domains, using model merging and policy gradients If you don't model it, why expect │ │ └───────┘ good results when trying to │ │ transmogrify it with your model? │ │ │ └──────────────────────────────────────────────────────────────────────────────────────────────────┘